Purpose of the data set: This crime dataset was supplied by LA County. It is Crime Data from 2010 to 2019 and it uses uniform crime reporting (UCR) created by the FBI to summarize incident info in a repeatable and comparable way to other cities, counties, and states. The LA City Mayor wrote that the purpose of the data was for transparency's sake as well as encouraging those outside of the government to be able to use the data for the sake of innovation and solving problems. [Ref: https://data.lacity.org/]
Data Source: The data set used for the purposes of this project is sourced from LOS ANGELES OPEN DATA.
Data Importance: Analyzing Crime data over several years helps identify high-risk populations and demographies and take corrective action to reduce crime and make the city safer.
Mining useful knowledge and measuring effectiveness of prediction algorithm:
For this data set we would like to predict the following features:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import descartes
import plotly.express as px
#!pip install geopandas
import geopandas as gpd
from shapely.geometry import Point, Polygon
%matplotlib inline
# Load the LA Crimes data set into pandas dataframe
df = pd.read_csv("Data/Crime_Data_from_2010_to_2019.csv")
# List the total number of rows and columns in the dataframe
print("Total number of rows in the dataframe: " + str(df.shape[0]))
print("Total number of columns in the dataframe: " + str(df.shape[1]))
# Displaying the data types of each column/attribute
df.info()
Upon committing to GitHub we discovered that the dataset was beyond a file limit of 100MB. In order to get this dataset to run from GitHub, we will be using the API method provided by LA City. However, this dataset does not resolve to the same datatypes when the JSON response is pulled in. Therefore transformation will have to be made to coerce these fields to be identical to our CSV imported datasets we were able to use locally.
from sodapy import Socrata
# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.lacity.org", None)
# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.lacity.org,
# MyAppToken,
# userame="user@example.com",
# password="AFakePassword")
# First 2000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("63jg-8b9z", limit=10000)
#results = client.get("63jg-8b9z", limit=10000)
# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)
results_df.info()
results_df['lat'] = results_df['lat'].astype(float)
results_df['lon'] = results_df['lon'].astype(float)
#results_df.lat = int(results_df.lat)
#results_df.lat = int(results_df.lon)
Based on the outputs above we have the following observations for the LA Crimes data set:
A detailed description of each attribute along with its meaning and data type are displayed in a table below.
Description information in the table below is sourced from LOS ANGELES OPEN DATA.
# Load Data Description file into pandas dataframe
data_desc = pd.read_csv('Data/Data_Description.csv')
from IPython.display import display, HTML
display(HTML(data_desc.to_html()))
Additional details about the data set can be found in the links below:
Link to Modus Operendi codes
Link to LAPD Reporting Districts
# Displaying the first 2 lines of the dataframe
df.head(2)
# Summary of attributes in the dataframe
df.describe()
# Display missing values in the Column "Vict Descent"
print("Missing values in the Column 'Vict Descent' are: " + str(df['Vict Descent'].isnull().sum()))
# Display number of victims grouped by their descent
# Descent Code:
# A: Other Asian, B: Black, C: Chinese, D: Cambodian, F: Filipino, G: Guamanian, H: Hispanic/Latin/Mexican,
# I: American Indian/Alaskan Native, J: Japanese, K: Korean, L: Laotian, O: Other, P: Pacific Islander,
# S: Samoan, U: Hawaiian, V: Vietnamese, W: White, X: Unknown, Z: Asian Indian
print()
print("Number of victims grouped by their descent are as below:")
df.groupby('Vict Descent').size()
Based on the outputs above we have the following observations for the 'Vict Descent' column:
Based on further analysis by looking at a random selection of records it is concluded that these could have been data entry errors. Hence, we can replace those records with Descent Code 'X', which is for Unknown category.
# Replace records in the 'Vict Descent' column having '-' with 'X'
df['Vict Descent'] = df['Vict Descent'].replace(to_replace='-',value='X')
# Replace records in the 'Vict Descent' column having blanks with 'X'
df['Vict Descent'].fillna('X', inplace = True)
# Display missing values in the Column "Vict Descent" and also display the number of victims grouped by their descent
print("Missing values in the Column 'Vict Descent' are: " + str(df['Vict Descent'].isnull().sum()))
print()
print("Number of victims grouped by their descent are as below:")
df.groupby('Vict Descent').size()
As seen in the output above, the Vict Descent category 'X' now has 274905 records.
# Display missing values in the Column "Vict Sex"
print("Missing values in the Column 'Vict Sex' are: " + str(df['Vict Sex'].isnull().sum()))
# Display number of victims grouped by their Sex
# Sex Code:
# F: Female, M: Male, X: Unknown
print()
print("Number of victims grouped by their Sex are as below:")
df.groupby('Vict Sex').size()
Based on the outputs above we have the following observations for the 'Vict Sex' column:
Based on further analysis by looking at a random selection of records it is concluded that these could have been data entry errors. Hence, we can replace those records with Sex Code 'X', which is for Unknown category.
# Replace records in the 'Vict Sex' column having '-' with 'X'
df['Vict Sex'] = df['Vict Sex'].replace(to_replace='-',value='X')
# Replace records in the 'Vict Sex' column having 'H' with 'X'
df['Vict Sex'] = df['Vict Sex'].replace(to_replace='H',value='X')
# Replace records in the 'Vict Sex' column having 'N' with 'X'
df['Vict Sex'] = df['Vict Sex'].replace(to_replace='N',value='X')
# Replace records in the 'Vict Sex' column having blanks with 'X'
df['Vict Sex'].fillna('X', inplace = True)
# Display missing values in the Column "Vict Sex" and also display the number of victims grouped by their Sex
print("Missing values in the Column 'Vict Sex' are: " + str(df['Vict Sex'].isnull().sum()))
print()
print("Number of victims grouped by their Sex are as below:")
df.groupby('Vict Sex').size()
As seen in the output above, the Vict Sex category 'X' now has 251927 records.
# Display missing values in the Column "Vict Age"
print("Missing values in the Column 'Vict Age' are: " + str(df['Vict Age'].isnull().sum()))
# Display number of victims grouped by their Age
print()
print("Number of victims grouped by their Age are as below:")
df.groupby('Vict Age').size()
# Display number of victims grouped by their Age, where Age value is less than or equal to zero
print()
print("Number of victims grouped by their Age, where Age value is less than or equal to zero, are as below:")
df[df['Vict Age']<=0].groupby('Vict Age').size()
# Display total number of records with invalid age values
# Age values zero and less are considered invalid age values
print()
print("Total number of records with invalid age values are: " + str( df[df['Vict Age']<=0]['Vict Age'].count() ))
# Display number of victims grouped by their Age, where Age value is greater than or equal to 100
print()
print("Number of victims grouped by their Age, where Age value is greater than or equal to 100, are as below:")
df[df['Vict Age']>=100].groupby('Vict Age').size()
# Display overall median victim age value
print()
print("Overall median victim age value in this data set is: " + str( df['Vict Age'].median() ))
# Display median victim age value of children (>0 & <18)
print()
print("Median victim age value of children (>0 & <18) in this data set is: " +
str( df[(df['Vict Age']>0) & (df['Vict Age']<18)]['Vict Age'].median() ))
# Display number of records grouped by 'Crime Code Description' categories that
# contain the word "CHILD" and with invalid age value less than or equal to zero
df[ (df['Crm Cd Desc'].str.contains("CHILD")) & (df['Vict Age']<=0) ].groupby('Crm Cd Desc').size()
# Display number of records grouped by 'Vict Age' that
# contain the word "CHILD" and with invalid age value less than or equal to zero
df[ (df['Crm Cd Desc'].str.contains("CHILD")) & (df['Vict Age']<=0) ].groupby('Vict Age').size()
# Display total number of records that contain the word "CHILD" in the Crime Code Desc and
# have an invalid age value less than or equal to zero
temp = df[ (df['Crm Cd Desc'].str.contains("CHILD")) & (df['Vict Age']<=0) ]['Vict Age'].count()
print("Total number of records that contain the word 'CHILD' in the Crime Code Desc and have an invalid age value less than or equal to zero are: " + str(temp))
Based on the outputs above we have the following observations for the 'Vict Age' column:
Based on further analysis by looking at a random selection of records it is concluded that these invalid age values could have been data entry errors.
Hence, we have concluded to perfom imputations as below:
# Replace records that contain the word 'CHILD' in the Crime Code Desc and
# have an invalid age value less than or equal to zero, with
# the Median victim age value for children which is 14.
# As seen in the outputs above, for this particular category the invalid age values are [0, -1, -2, -3]
temp = df[ (df['Crm Cd Desc'].str.contains("CHILD")) & (df['Vict Age']<=0) ]['Vict Age']
temp = temp.replace(to_replace=[0, -1, -2, -3],value=14)
df.loc[ ( (df['Crm Cd Desc'].str.contains("CHILD")) & (df['Vict Age']<=0) ), 'Vict Age'] = temp
# Now replace the remaining records that have an invalid age value less than or
# equal to zero, with the Overall median victim age value which is 32.
# As seen in the outputs above, for this particular category the invalid age values
# are [0, -1, -2, -3, -4, -5, -6, -7, -8, -9]
df.loc[ df['Vict Age']<=0, 'Vict Age' ] = 32
# Display total number of records with invalid age values
# Age values zero and less are considered invalid age values
print()
print("Total number of records with invalid age values are: " + str( df[df['Vict Age']<=0]['Vict Age'].count() ))
# Display number of victims grouped by their Age
print()
print("Number of victims grouped by their Age are as below:")
df.groupby('Vict Age').size()
As seen in the output above, there are now NO records with invalid age values.
# Display missing values in Columns 'LAT' and 'LON'
print()
print("Missing values in the Column 'LAT' are: " + str(df['LAT'].isnull().sum()))
print("Missing values in the Column 'LON' are: " + str(df['LON'].isnull().sum()))
# Display records grouped by LAT values
print()
print("Records grouped by Latitude values:")
df.groupby('LAT').size()
# Display records grouped by LON values
print()
print("Records grouped by Longitude values:")
df.groupby('LON').size()
# Display mean LAT and LON values
print()
mean_LAT = df['LAT'].mean()
mean_LON = df['LON'].mean()
print( "Mean latitude value in the current data set is: " + str(mean_LAT) )
print( "Mean longitude value in the current data set is: " + str(mean_LON) )
Based on the outputs above we have the following observations for the 'LAT' and 'LON' columns:
According to Google maps, the latitude of Los Angeles, CA, USA is 34.052235, and the longitude is -118.243683. Which indicates that the LAT and LON values with 0.0000 are invalid.
Based on further analysis by looking at a random selection of records it is concluded that these invalid values could have been data entry errors.
Hence, we have concluded to replace these invalid values with the 'mean' values.
# Replace invalid 'LAT' and 'LON' values with their mean values respectively
df.loc[ df['LAT']==0, 'LAT' ] = mean_LAT
df.loc[ df['LON']==0, 'LON' ] = mean_LON
# Display min and max values of 'LAT' and 'LON' columns
print()
print( "Min LAT is: " + str(df['LAT'].min()) + " and " + "Max LAT is: " + str(df['LAT'].max()) )
print( "Min LON is: " + str(df['LON'].min()) + " and " + "Max LON is: " + str(df['LON'].max()) )
As seen in the output above, there are now NO records with invalid LAT and LON values.
# Cosmetic correction
# When exploring the column name for AREA, a trailiing space was found.
# Instead of having the column name as 'AREA' we have 'AREA '
# The code in this cell is to remove the trailing space
df = df.rename(columns = {"AREA ":"AREA"})
# Display records grouped by AREA and their corresponding AREA NAME
print()
print("Records grouped by 'AREA' and their corresponding 'AREA NAME':")
df.groupby(['AREA', 'AREA NAME']).size()
# Identify missing values in 'AREA' and 'AREA NAME' columns
print()
print( "Missing values in column 'AREA' are: " + str(df['AREA'].isnull().sum()) )
print( "Missing values in column 'AREA NAME' are: " + str(df['AREA NAME'].isnull().sum()) )
Based on the outputs above we have the following observations for 'AREA' and 'AREA NAME' columns:
Hence no modifications would be made to these columns.
# Identify missing values in 'Status' and 'Status Desc' columns
print()
print( "Missing values in column 'Status' are: " + str(df['Status'].isnull().sum()) )
print( "Missing values in column 'Status Desc' are: " + str(df['Status Desc'].isnull().sum()) )
# Display records grouped by 'Status' and their corresponding 'Status Desc'
print()
print("Records grouped by 'Status' and their corresponding 'Status Desc':")
df.groupby(['Status', 'Status Desc']).size()
Based on the outputs above we have the following observations for the 'Status' and 'Status Desc' columns:
Based on further analysis, by looking at the actual records, we found the following:
Hence, we concluded that these invalid values could have been data entry errors. And we would replace these invalid values with the Status code 'IC' and Status Desc 'Invest Cont' which are the default values for these fields.
# Replace 'UNK' in 'Status Desc' column with 'Invest Cont'
df.loc[ df['Status Desc']=='UNK', 'Status Desc' ] = 'Invest Cont'
# Replace records with status codes '13', '19', 'TH' and 'CC' with 'IC'
df.loc[ df['Status']=='13', 'Status' ] = 'IC'
df.loc[ df['Status']=='19', 'Status' ] = 'IC'
df.loc[ df['Status']=='TH', 'Status' ] = 'IC'
df.loc[ df['Status']=='CC', 'Status' ] = 'IC'
# Replace blanks in 'Status' column with 'IC'
df['Status'].fillna('IC', inplace = True)
# Identify missing values in 'Status' and 'Status Desc' columns
print()
print( "Missing values in column 'Status' are: " + str(df['Status'].isnull().sum()) )
print( "Missing values in column 'Status Desc' are: " + str(df['Status Desc'].isnull().sum()) )
# Display records grouped by 'Status' and their corresponding 'Status Desc'
print()
print("Records grouped by 'Status' and their corresponding 'Status Desc':")
df.groupby(['Status', 'Status Desc']).size()
As seen in the output above, there are now NO records with invalid 'Status' and 'Status Desc' values.
# Identify missing values in 'Crm Cd' and 'Crm Cd Desc' columns
print()
print( "Missing values in column 'Crm Cd' are: " + str(df['Crm Cd'].isnull().sum()) )
print( "Missing values in column 'Crm Cd Desc' are: " + str(df['Crm Cd Desc'].isnull().sum()) )
# Display records grouped by 'Crm Cd' and their corresponding 'Crm Cd Desc'
print()
print("Records grouped by 'Crm Cd' and their corresponding 'Crm Cd Desc':")
df.groupby(['Crm Cd', 'Crm Cd Desc']).size()
As seen above there are NO errors in 'Crm Cd' and 'Crm Cd Desc' columns.
# Identify missing values in 'Weapon Used Cd' and 'Weapon Desc' columns
print()
print( "Missing values in column 'Weapon Used Cd' are: " + str(df['Weapon Used Cd'].isnull().sum()) )
print( "Missing values in column 'Weapon Desc' are: " + str(df['Weapon Desc'].isnull().sum()) )
# As seen in the output above, there is one record that does not have a 'Weapon Desc' but has a 'Weapon Used Cd'
# Display the one record that does not have a 'Weapon Desc' but has a 'Weapon Used Cd'
df[ (df['Weapon Desc'].isnull()) & (df['Weapon Used Cd'].notnull()) ]
# Display records grouped by 'Weapon Used Cd' and their corresponding 'Weapon Desc'
print()
print("Records grouped by 'Weapon Used Cd' and their corresponding 'Weapon Desc':")
df.groupby(['Weapon Used Cd', 'Weapon Desc']).size()
Based on the outputs above we have the following observations for 'Weapon Used Cd' and 'Weapon Desc' columns:
There is one record that does not have a 'Weapon Desc' but has a 'Weapon Used Cd'. Looking closely at the record output it is clear that the crime was committed with a 'DEADLY WEAPON'. So we will conclude that the description of the weapon was not available. Since this is just one record, we will leave the record as is.
# Identify missing values in 'Premis Cd' and 'Premis Desc' columns
print()
print( "Missing values in column 'Premis Cd' are: " + str(df['Premis Cd'].isnull().sum()) )
print( "Missing values in column 'Premis Desc' are: " + str(df['Premis Desc'].isnull().sum()) )
# Display records grouped by 'Premis Cd' and their corresponding 'Premis Desc'
print()
print("Records grouped by 'Premis Cd' and their corresponding 'Premis Desc':")
df.groupby(['Premis Cd', 'Premis Desc']).size()
# Identify missing values in 'DR_NO', 'Date Rptd', 'DATE OCC', 'TIME OCC', 'Rpt Dist No', 'Part 1-2' columns
print()
print( "Missing values in column 'DR_NO' are: " + str(df['DR_NO'].isnull().sum()) )
print( "Missing values in column 'Date Rptd' are: " + str(df['Date Rptd'].isnull().sum()) )
print( "Missing values in column 'DATE OCC' are: " + str(df['DATE OCC'].isnull().sum()) )
print( "Missing values in column 'TIME OCC' are: " + str(df['TIME OCC'].isnull().sum()) )
print( "Missing values in column 'Rpt Dist No' are: " + str(df['Rpt Dist No'].isnull().sum()) )
print( "Missing values in column 'Part 1-2' are: " + str(df['Part 1-2'].isnull().sum()) )
# Display min and max values from 'TIME OCC' column so as to verify if the time values had an errors
print()
print("Minimum value in 'TIME' column: " + str(df['TIME OCC'].min()) )
print("Maximum value in 'TIME' column: " + str(df['TIME OCC'].max()) )
As seen above there are NO errors in the TIME OCC column. The time values fall within the range of 0000 till 2359 (Military time).
# Identify duplicate rows in the entire data set, using information from all the columns
# The below code marks duplicates as 'True' except for the first occurrence.
# df.duplicated(subset=None, keep='first')
print()
print("Grouping non-duplicates into False bucket and duplicates into True bucket:")
df.groupby( [df.duplicated(subset=None, keep='first')] ).size()
As seen in the output above, There are NO duplicate rows in the current data set.
From the outputs of "df.info()" and "display(HTML(data_desc.to_html()))" given above, we conclude that there are NO duplicate columns in the current data set.
However, it is important to note that the information in columns "Crm Cd" and "Crm Cd 1" is basically the same. So when building models we would make use of only one of these columns.
When analyzing each individual column in the data set the only outliers we encountered were two individuals with age 114 and 118. The same is graphically represented in the box-plot below.
Please refer to section "Exploring Column: Vict Age" for more details.
# Box plot to display outliers in the 'Vict Age' column.
df.boxplot(column='Vict Age', return_type='axes');
# Find Missing values in each column
df.isnull().sum()
# Summary of attributes in the dataframe
df.describe()
Victim Age appears to be highest in the thirties and goes down from there. This is
df['Vict Age'].mean()
df.hist(column='TIME OCC')
# Extracting Year of Crime as an attribute
df['year'] = pd.DatetimeIndex(df['DATE OCC']).year
# Extracting Month of Crime as an attribute
df['month'] = pd.DatetimeIndex(df['DATE OCC']).month_name()
plt.style.use('ggplot')
df_area = df.groupby(by=['AREA NAME'])
area_crime_count = df_area['AREA NAME'].count()
area_crime_count.sort_values().plot.barh(title= 'Crime Reported Over Last 10 Years')
df['DayHr'] = pd.cut(df['TIME OCC'],[0,59,159,259,359,459,559,659,759,859,959,1059,1159,1259,1359,1459,1559,1659,1759,1859,1959,2059,2159,2259,2359],24,labels=['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24'])
df_dayhr_tmp = df.groupby('DayHr').count().reset_index().iloc[:,[0,1]]
df_dayhr = df_dayhr_tmp.rename(columns={df_dayhr_tmp.columns[0]: "HrOfDay", df_dayhr_tmp.columns[1]: "Count"})
#ax1 = plt.pie(df_daytime.Count,labels=df_daytime.TimeOfDay,autopct='%1.1f%%')
#plt.title('Crime Occurance by Time of Day')
df_dayhr
plt.bar(df_dayhr.HrOfDay,df_dayhr.Count)
plt.title('Crime Count by the hour of Day')
plt.xlabel('Hour of Day')
plt.ylabel('Total number of crimes reported')
# Add comment on code pick up
df['DayType'] = pd.cut(df['TIME OCC'],[0,359,759,1159,1559,1959,2359],6,labels=['00hrs - 04hrs','04hrs - 08hrs','08hrs - 12hrs','12hrs - 16hrs','16hrs - 20hrs','20hrs - 00hrs'])
df_daytime_tmp = df.groupby('DayType').count().reset_index().iloc[:,[0,1]]
df_daytime = df_daytime_tmp.rename(columns={df_daytime_tmp.columns[0]: "TimeOfDay", df_daytime_tmp.columns[1]: "Count"})
ax1 = plt.pie(df_daytime.Count,labels=df_daytime.TimeOfDay,autopct='%1.1f%%')
plt.title('% Crime Reports in 4hrs window')
df_crimecode_grp = df.groupby(by=['Crm Cd','Crm Cd Desc']).count().reset_index().iloc[:,[0,1,2]]
df_crimecode = df_crimecode_grp.rename(columns={df_crimecode_grp.columns[0]: "Crime Code", df_crimecode_grp.columns[1]: "Description",df_crimecode_grp.columns[2]: "Count"})
df_crimecode['%']=100*df_crimecode.Count/df_crimecode.Count.sum()
df_crimecode.sort_values(['%'],inplace=True,ascending=False)
df_crimecode['Cum %']=df_crimecode['%'].cumsum()
#plt.bar(df_crimecode.head(10)['Crime Code'].apply(str),df_crimecode.head(10)['%'],align='center')
plt.barh(df_crimecode.head(10)['Description'],df_crimecode.head(10)['%'],align='center')
plt.xlabel('% Crime Reports')
plt.title('% Top 10 Crime Types')
plt.gca().invert_yaxis()
df_premiscode_grp = df.groupby(by=['Premis Cd','Premis Desc']).count().reset_index().iloc[:,[0,1,2]]
df_premiscode = df_premiscode_grp.rename(columns={df_premiscode_grp.columns[0]: "Premis Code", df_premiscode_grp.columns[1]: "Description",df_premiscode_grp.columns[2]: "Count"})
df_premiscode['%']=100*df_premiscode.Count/df_premiscode.Count.sum()
df_premiscode.sort_values(['%'],inplace=True,ascending=False)
df_premiscode['Cum %']=df_premiscode['%'].cumsum()
#plt.bar(df_crimecode.head(10)['Crime Code'].apply(str),df_crimecode.head(10)['%'],align='center')
plt.barh(df_premiscode.head(10)['Description'],df_premiscode.head(10)['%'],align='center')
plt.gca().invert_yaxis()
plt.xlabel('% Crime Reports')
plt.title('% Top 10 Crime Premises')
This histogram of victims by ages lets us see that there is a right skew to the distribution of ages. The peak is around the earlier noted mean of 37.
df.hist(column='Vict Age')
plt.xlabel('Age')
plt.title('Victim Counts by Age')
This was largely developed using a Geopandas tutorial found on [Towards Datascience] (https://towardsdatascience.com/geopandas-101-plot-any-data-with-a-latitude-and-longitude-on-a-map-98e01944b972) and using shape data of the City Boundaries from LA City Geo Hub. These shape files consist of many points and give the background of our map. This allows us to take the geolocational data and overlay it on a map of the LA City Boundaries.
#create a geopandas dataframe and convert lat/long to point geometry:
geometry = [Point(xy) for xy in zip(df["LON"], df["LAT"])]
geometry[:3]
#Tell it that we are using Lat/Long as our coordinates system
crs = {'init': 'epsg:4326'}
geo_df = gpd.GeoDataFrame(df, crs = crs, geometry = geometry)
geo_df.head(2)
#This was just to limit data
#geo_400 = geo_df[geo_df['TIME OCC'] == 400].describe()
#geo_400 = geo_400[geo_df['LAT'] != 0].describe()
geo_df[(geo_df['Vict Age']>=18) & (geo_df['Vict Age']<40)]
This shape file from the LA City Geohub allows us to create a map of LA to plot the locational Longitude/Latitude coordinates
#Read in the Shape file to create a general outline of the LA City Boundaries
la_map = gpd.read_file('Data/shape/City_Boundaries.shp')
#fig,ax = plt.subplots(figsize=(15,15))
#la_map.plot(ax=ax)
Looking at the overall boundaries, it is apparent that crime happens across ages in the tighter city limits at the lower bottom left.
#Finally we are able to plot the data on the map
fig,ax = plt.subplots(figsize = (15,15))
la_map.plot(ax= ax, alpha = 0.4, color = "grey")
#starting with the largest group and decreasing to limit as much overlay as possible
geo_df[(geo_df['Vict Age']>=18) & (geo_df['Vict Age']<=40)].plot(ax = ax, markersize = 10, color = "red", marker = 'o', label = 'Victim Aged 18 - 40')
geo_df[geo_df['Vict Age']>40].plot(ax = ax, markersize = 10, color = "orange", marker = 'o', label = 'Victim Over 40')
geo_df[geo_df['Vict Age']<18].plot(ax = ax, markersize = 10, color = "blue", marker = 'o', label = 'Victim Under 18')
plt.title('LA Crime By General Age groups \n (Under 18, 18 to 40, and Over 40)')
plt.legend(prop={'size':15})
#Same plot, but looking to see if there are Sex specific areas that appear
fig,ax = plt.subplots(figsize = (15,15))
la_map.plot(ax= ax, alpha = 0.4, color = "grey")
geo_df[(geo_df['Vict Sex']=='M')].plot(ax = ax, markersize = 10, color = "blue", marker = 'o', label = 'Victim Aged 18 - 40')
geo_df[geo_df['Vict Sex']=='F'].plot(ax = ax, markersize = 10, color = "red", marker = 'o', label = 'Victim Under 18')
plt.title('LA Crime \n (Males Vs Females)')
plt.legend(prop={'size':15})
df_ad_grp = df.groupby(['AREA NAME', 'Vict Descent']).count().reset_index().iloc[:,[0,1,2]]
df_ad = df_ad_grp.rename(columns={df_ad_grp.columns[0]: "Area", df_ad_grp.columns[1]: "Descent", df_ad_grp.columns[2]: "Count"})
fig = px.treemap(df_ad, path=['Area', 'Descent'],values='Count')
fig.update_layout(title="Crime Area & Victim Descent Treemap",width=800, height=500,title_x=0.5)
fig.show()
df_area_yr_grp = df.groupby(by=['AREA NAME','year']).count().reset_index().iloc[:,[0,1,2]]
df_area_yr = df_area_yr_grp.rename(columns={df_area_yr_grp.columns[0]: "Area", df_area_yr_grp.columns[1]: "Year",df_area_yr_grp.columns[2]: "Count"})
df_area_yr_pivot = df_area_yr.pivot(index='Area', columns='Year', values='Count').transpose()
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['77th Street'],marker='o',label= '77th Street')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Southwest'],marker='o',label= 'Southwest')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['N Hollywood'],marker='o',label= 'N Hollywood')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Pacific'],marker='o',label= 'Pacific',linewidth=3)
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Southeast'],marker='o',label= 'Southeast')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Mission'],marker='o',label= 'Mission')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Northeast'],marker='o',label= 'Northeast')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Newton'],marker='o',label= 'Newton')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Van Nuys'],marker='o',label= 'Van Nuys')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Hollywood'],marker='o',label= 'Hollywood')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Central'],marker='o',label= 'Central',linewidth=3, color ='red')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Topanga'],marker='o',label= 'Topanga')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Devonshire'],marker='o',label= 'Devonshire')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Olympic'],marker='o',label= 'Olympic')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Harbor'],marker='o',label= 'Harbor')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Rampart'],marker='o',label= 'Rampart')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['West Valley'],marker='o',label= 'West Valley')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['West LA'],marker='o',label= 'West LA')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Wilshire'],marker='o',label= 'Wilshire')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Foothill'],marker='o',label= 'Foothill')
plt.plot( df_area_yr_pivot.index.values,df_area_yr_pivot['Hollenbeck'],marker='o',label= 'Hollenbeck')
plt.ylim(5000,20000)
plt.legend(loc='lower center',bbox_to_anchor=(1.2, -0.3))
plt.xlabel('Year')
plt.ylabel('Number of Crime Reports')
plt.title('LA Crime Over 10 Years \n (Different Policing Areas)')
df_sex_yr_grp = df.groupby(by=['Vict Sex','year']).count().reset_index().iloc[:,[0,1,2]]
df_sex_yr = df_sex_yr_grp.rename(columns={df_sex_yr_grp.columns[0]: "Sex", df_sex_yr_grp.columns[1]: "Year",df_sex_yr_grp.columns[2]: "Count"})
df_sex_yr_pivot = df_sex_yr.pivot(index='Sex', columns='Year', values='Count').transpose()
plt.plot( df_sex_yr_pivot.index.values,df_sex_yr_pivot['M'],marker='o',label= 'M',linewidth=2,color="blue")
plt.plot( df_sex_yr_pivot.index.values,df_sex_yr_pivot['F'],marker='o',label= 'F',linewidth=2,color="red")
plt.ylim(50000,120000)
plt.legend(loc='lower center',bbox_to_anchor=(1.2,0.3))
plt.xlabel('Year')
plt.ylabel('Number of Crime Reports')
plt.title('LA Crime Over 10 Years \n (Males Vs Females)')
df_cent_crcd_yr_grp = df.loc[df['AREA NAME'] == 'Central'].groupby(by=['Crm Cd','Crm Cd Desc', 'year']).count().reset_index().iloc[:,[0,1,2,3]]
df_sex_yr_grp = df.groupby(by=['Vict Sex','year']).count().reset_index().iloc[:,[0,1,2]]
df_cent_crcd_yr = df_cent_crcd_yr_grp.rename(columns={df_cent_crcd_yr_grp.columns[0]: "CrimeCode", df_cent_crcd_yr_grp.columns[1]: "Description",df_cent_crcd_yr_grp.columns[2]: "Year",df_cent_crcd_yr_grp.columns[3]: "Count"})
df_cent_crcd_yr_pivot = df_cent_crcd_yr.pivot(index='Description', columns='Year', values='Count').transpose()
for (columnName, columnData) in df_cent_crcd_yr_pivot.iteritems():
incThresh = float(columnData.values[9])/float(columnData.values[4])
if (incThresh > 10):
plt.plot( df_cent_crcd_yr_pivot.index.values,columnData.values/columnData.values[4],marker='8',label= [str(columnName)],linewidth=4,color='purple')
if (incThresh > 5 and (incThresh <= 10)):
plt.plot( df_cent_crcd_yr_pivot.index.values,columnData.values/columnData.values[4],marker='s',label= [str(columnName)],linewidth=3,color='red')
if ((incThresh > 4) and (incThresh <= 5)):
plt.plot( df_cent_crcd_yr_pivot.index.values,columnData.values/columnData.values[4],marker='o',label= [str(columnName)],linewidth=2,linestyle='--',color='blue')
if ((incThresh > 3) and (incThresh <= 4)):
plt.plot( df_cent_crcd_yr_pivot.index.values,columnData.values/columnData.values[4],marker='*',label= [str(columnName)],linewidth=1,linestyle='-.',color='black')
if ((incThresh > 2) and (incThresh <= 3)):
plt.plot( df_cent_crcd_yr_pivot.index.values,columnData.values/columnData.values[4],marker='d',label= [str(columnName)],linewidth=0.5,linestyle=':',color='grey')
plt.ylim(0.1,100)
plt.yscale('log')
plt.legend(loc='right',bbox_to_anchor=(2.2,0.5))
plt.xlabel('Year')
plt.ylabel('Crime reports normalized to yr 2014')
plt.title('Increasing Crime Types in Central Area')
df_cent_sex_yr_grp = df.loc[df['AREA NAME'] == 'Central'].groupby(by=['Vict Sex','year']).count().reset_index().iloc[:,[0,1,2]]
df_cent_sex_yr = df_cent_sex_yr_grp.rename(columns={df_cent_sex_yr_grp.columns[0]: "Sex", df_cent_sex_yr_grp.columns[1]: "Year",df_cent_sex_yr_grp.columns[2]: "Count"})
df_cent_sex_yr_pivot = df_cent_sex_yr.pivot(index='Sex', columns='Year', values='Count').transpose()
#df_cent_sex_yr_pivot
plt.plot( df_cent_sex_yr_pivot.index.values,df_cent_sex_yr_pivot['M'],marker='o',label= 'M',linewidth=2,color="blue")
plt.plot( df_cent_sex_yr_pivot.index.values,df_cent_sex_yr_pivot['F'],marker='o',label= 'F',linewidth=2,color="red")
#plt.ylim(50000,120000)
plt.legend(loc='lower center',bbox_to_anchor=(1.2,0.3))
plt.xlabel('Year')
plt.ylabel('Number of Crime Reports')
plt.title('LA Crime In Central Area \n (Males Vs Females)')
df_cent_des_yr_grp = df.loc[df['AREA NAME'] == 'Central'].groupby(by=['Vict Descent','year']).count().reset_index().iloc[:,[0,1,2]]
df_cent_des_yr = df_cent_des_yr_grp.rename(columns={df_cent_des_yr_grp.columns[0]: "Sex", df_cent_des_yr_grp.columns[1]: "Year",df_cent_des_yr_grp.columns[2]: "Count"})
df_cent_des_yr_pivot = df_cent_des_yr.pivot(index='Sex', columns='Year', values='Count').transpose()
#df_cent_des_yr_pivot
for (columnName, columnData) in df_cent_des_yr_pivot.iteritems():
plt.plot( df_cent_crcd_yr_pivot.index.values,columnData.values/columnData.values[4],marker='o',label= [str(columnName)],linewidth=2)
plt.ylim(0.1,50)
plt.yscale('log')
plt.legend(loc='right',bbox_to_anchor=(1.2,0.5))
plt.xlabel('Year')
plt.ylabel('Crime reports normalized to yr 2014')
plt.title('Increasing Crime Types \n Central Area by Victim Descent')
df_yr_month_grp = df.groupby(['year', 'month']).count().reset_index().iloc[:,[0,1,2]]
df_yr_month = df_yr_month_grp.rename(columns={df_yr_month_grp.columns[0]: "Year", df_yr_month_grp.columns[1]: "Month", df_yr_month_grp.columns[2]: "Count"})
#sns.heatmap(df_yr_month.pivot(index='Year', columns='Month', values='Count'), linewidths=.5)
In the future we would like to get more data from the reported and occurrence dates. By subtracting we would like to see how long it took to report the crime. We could also use these dates in conjunctions with APIs to determine sunset time and get an idea of if it was day or night for day and time of occurrence. There is an open Sunrise-Sunset API (https://sunrise-sunset.org/api) that would take in the latitide and longitude as well to get precise times for that geolocation.
Another API could be used to see if the crime was committed on a Holiday. To get the most commonly given off Holidays, we could use Calendarific (https://calendarific.com/api-documentation) as they give the option to filter by type of holiday and limit to national holidays.
Some fields could just be extracted into new transformed fields. We could also denote crime occurrences by the day of the week they occured and whether it was on a weekend or not. Rather than working with the victim age as a continuous variable, it may be useful to group by demographics into age groups. Weapons not mentioned as present are represented as null in the Weapons Used Code field. This could be a single flag of Weapon Used that resolves to a true false. Again checking for nulls to get a charge count from the Crime code fields 1-4 could be put into a crime count field.
To make use of the MOCodes (Modus Operandi Codes) will take some processing to put them into a list and aggregate across the dataset. This could then be used to tally the most frequently used and dummy coding them as new columns that an incident either has or does not have. Rather than just adding potentially hundreds of new fields to each row.
The precincts are given and can be joined to other precinct datasets. They have a total area for the precinct (https://geohub.lacity.org/datasets/lapd-reporting-districts). We could find how many cars are assigned to the precint in the Basic Car Plan (http://www.lapdonline.org/search_results/content_basic_view/6528)
#Test commit
#second test